Lending Club Case Study¶
Introduction¶
As an employee of a consumer finance company specializing in lending various types of loans to urban clients, part of my responsibility involves facilitating loan approval decision-making. This entails evaluating application profiles and identifying potential risks associated with loan repayment. To accomplish this task, I need to analyze the data provided in "loan.csv", which contains historical information about past loan applicants along with their default status. The goal is to identify patterns that indicate the likelihood of an applicant defaulting, enabling us to take appropriate actions such as denying a loan, adjusting loan terms, or applying higher interest rates to risky applicants.¶
Through this analysis, my objective is to gain insights into the consumer and loan attributes that influence the likelihood of default, as well as to identify the key driving factors or variables behind loan defaults. By understanding these factors, the company can improve its portfolio management and risk assessment strategies.¶
Problem Statement¶
The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.¶
When a person applies for a loan, there are two types of decisions that could be taken by the company:¶
1. Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:¶
- Fully paid: Applicant has fully paid the loan (the principal and the interest rate)¶
- Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'.¶
- Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan¶
2. Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)¶
Preliminary Wrangling¶
Importing essential libraries and configuring settings for data analysis and visualization¶
Gathering¶
Number of Rows in loan.csv : 39717 Number of columns in loan.csv : 111
DataSet : Loan.csv
Out[5]:
| id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | url | desc | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | annual_inc_joint | dti_joint | verification_status_joint | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_il_6m | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1077501 | 1296599 | 5000 | 5000 | 4975.0 | 36 months | 10.65% | 162.87 | B | B2 | NaN | 10+ years | RENT | 24000.0 | Verified | Dec-11 | Fully Paid | n | https://lendingclub.com/browse/loanDetail.action?loan_id=1077501 | Borrower added on 12/22/11 > I need to upgrade my business technologies.<br> | credit_card | Computer | 860xx | AZ | 27.65 | 0 | Jan-85 | 1 | NaN | NaN | 3 | 0 | 13648 | 83.70% | 9 | f | 0.0 | 0.0 | 5863.155187 | 5833.84 | 5000.0 | 863.16 | 0.0 | 0.0 | 0.0 | Jan-15 | 171.62 | NaN | May-16 | 0.0 | NaN | 1 | INDIVIDUAL | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
Number of Rows in Data_Dictionary.xlsx : 115 Number of columns in Data_Dictionary.xlsx : 2
DataSet : Data_Dictionary.xlsx
Out[7]:
| LoanStatNew | Description | |
|---|---|---|
| 0 | acc_now_delinq | The number of accounts on which the borrower is now delinquent. |
Observation¶
The dataset comprises roughly 39,717 loan records and encompasses about 111 columns, which include both consumer and loan data. Furthermore, the data dictionary contains approximately 115 entries and is structured with 2 columns: one for the column names and another for their respective descriptions.¶
Assessing¶
Observation¶
1. The dataset does not contain any duplicate rows.¶
2. Approximately 54 columns have all NULL values.¶
3. About 9 columns have the same values across all records.¶
4. The dataset comprises approximately 7 categorical variables: 'term', 'grade', 'sub_grade', 'verification_status', 'loan_status', 'purpose', and 'home_ownership'.¶
Cleaning¶
1) Remove all the columns that are being used post loan approval.¶
2) Identify and remove rows where the loan status == "Current".¶
3) Remove the columns with all NULL or NaN values.¶
4) Remove the textual or masked columns that is irrelevant to analysis.¶
5) Remove all columns that have identical values across all rows.¶
6) Cleanse the data within columns containing '%' symbols.¶
7) Strip the alphabets from the sub-grade column.¶
8) Standardize the values within the emp_length column.¶
9) Round-off the amount value/ interest rate to the nearest two decimal places.¶
10) Convert the data-type of date columns to appropriate date formats.¶
11) Convert the cleaned '%' data in columns to float data type.¶
12) Convert the data type of columns having categorical value to categorical data types.¶
13) Decompose the date columns into smaller units like month and year.¶
14) Derive categorical variables from the loan_amnt and int_rate.¶
15) Rename the columns for clarity by using full terms instead of abbreviations.¶
16) Address the missing values in the dataset through imputation or deletion.¶
17) Handle the outliers in the data.¶
Number of Rows in df_clean after Cleaning : 37485 Number of columns in df_clean after Cleaning: 23
Below is the data dictionary for the remaining columns on which we will conduct the analysis.¶
| LoanStatNew | Description | |
|---|---|---|
| 0 | addr_state | The state provided by the borrower in the loan application |
| 1 | annual_inc | The self-reported annual income provided by the borrower during registration. |
| 2 | dti | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| 3 | emp_length | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
| 4 | funded_amnt | The total amount committed to that loan at that point in time. |
| 5 | funded_amnt_inv | The total amount committed by investors for that loan at that point in time. |
| 6 | grade | LC assigned loan grade |
| 7 | home_ownership | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. |
| 8 | installment | The monthly payment owed by the borrower if the loan originates. |
| 9 | int_rate | Interest Rate on the loan |
| 10 | issue_d | The month which the loan was funded |
| 11 | loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
| 12 | loan_status | Current status of the loan |
| 13 | pub_rec_bankruptcies | Number of public record bankruptcies |
| 14 | purpose | A category provided by the borrower for the loan request. |
| 15 | sub_grade | LC assigned loan subgrade |
| 16 | term | The number of payments on the loan. Values are in months and can be either 36 or 60. |
| 17 | verification_status | Indicates if income was verified by LC, not verified, or if the income source was verified |
Below is the segregation of Customer and Loan attributes post Data Assessment and Cleaning¶
Loan Attributes¶
1. term → Categorical Data Type¶
2. issue_d → DateTime Data Type¶
3. grade → Categorical Data Type¶
4. sub_grade → Categorical Data Type¶
5. verification_status → Categorical Data Type¶
6. loan_status → Categorical Data Type¶
7. purpose → Categorical Data Type¶
8. loan_amnt → Float Data Type¶
9. funded_amnt → Float Data Type¶
10. funded_amnt_inv → Float Data Type¶
11. int_rate → Float Data Type¶
12. installment → Float Data Type¶
Observation:¶
Following the cleaning process, the dataset now contains 37,170 records and 23 columns. This reflects a reduction of approximately 6.41% in the number of records and an 74% reduction in the number of columns with 5 new derived columns. Subsequently, we will employ this refined dataset for our Exploratory Data Analysis (EDA), encompassing univariate, segmented univariate, bivariate, and multivariate analyses.¶
Exploratory Data Analisys¶
Numerical_Columns : ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'installment', 'annual_inc', 'debt_to_income', 'pub_rec_bankruptcies', 'issue_d_year'] Cateogrical_Columns : ['term', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'loan_status', 'issue_d_month', 'loan_amnt_b', 'int_rate_b', 'debt_to_income_b'] Extra_Columns : ['issue_d', 'purpose', 'addr_state']
Univariate Exploration¶
Initiating detailed analysis of loan_amnt... Statistical summary for loan_amnt: count 37485.000000 mean 10678.593437 std 6884.877579 min 500.000000 25% 5000.000000 50% 9500.000000 75% 15000.000000 max 35000.000000 Name: loan_amnt, dtype: float64 The mode of loan_amnt is: 10000.0
Initiating detailed analysis of annual_inc... Statistical summary for annual_inc: count 37485.00000 mean 63517.37955 std 32070.74777 min 4000.00000 25% 40000.00000 50% 57000.00000 75% 80000.00000 max 189000.00000 Name: annual_inc, dtype: float64 The mode of annual_inc is: 60000.0
Initiating detailed analysis of int_rate... Statistical summary for int_rate: count 37485.000000 mean 11.878059 std 3.661481 min 5.420000 25% 8.940000 50% 11.710000 75% 14.270000 max 24.400000 Name: int_rate, dtype: float64 The mode of int_rate is: 10.99
Initiating detailed analysis of issue_d_year... Statistical summary for issue_d_year: count 37485.000000 mean 2010.303188 std 0.883444 min 2007.000000 25% 2010.000000 50% 2011.000000 75% 2011.000000 max 2011.000000 Name: issue_d_year, dtype: float64 The mode of issue_d_year is: 2011 Count summary for issue_d_year: issue_d_year 2011 19775 2010 11318 2009 4614 2008 1538 2007 240 Name: count, dtype: int64
Initiating detailed analysis of term... Statistical summary for term: count 37485 unique 2 top 36 months freq 28417 Name: term, dtype: object The mode of term is: 36 months Count summary for term: term 36 months 28417 60 months 9068 Name: count, dtype: int64
Initiating detailed analysis of grade... Statistical summary for grade: count 37485 unique 7 top B freq 11394 Name: grade, dtype: object The mode of grade is: B Count summary for grade: grade B 11394 A 9873 C 7615 D 4924 E 2508 F 900 G 271 Name: count, dtype: int64
Initiating detailed analysis of sub_grade... Statistical summary for sub_grade: count 37485 unique 5 top 4 freq 7865 Name: sub_grade, dtype: object The mode of sub_grade is: 4 Count summary for sub_grade: sub_grade 4 7865 3 7743 5 7598 2 7434 1 6845 Name: count, dtype: int64
Initiating detailed analysis of emp_length... Statistical summary for emp_length: count 37485 unique 11 top 10 freq 9131 Name: emp_length, dtype: object The mode of emp_length is: 10 Count summary for emp_length: emp_length 10 9131 0 4414 2 4184 3 3924 4 3247 5 3113 1 3110 6 2114 7 1671 8 1377 9 1200 Name: count, dtype: int64
Initiating detailed analysis of home_ownership... Statistical summary for home_ownership: count 37485 unique 5 top RENT freq 18240 Name: home_ownership, dtype: object The mode of home_ownership is: RENT Count summary for home_ownership: home_ownership RENT 18240 MORTGAGE 16241 OWN 2906 OTHER 95 NONE 3 Name: count, dtype: int64
Initiating detailed analysis of verification_status... Statistical summary for verification_status: count 37485 unique 3 top Not Verified freq 16528 Name: verification_status, dtype: object The mode of verification_status is: Not Verified Count summary for verification_status: verification_status Not Verified 16528 Verified 11558 Source Verified 9399 Name: count, dtype: int64
Initiating detailed analysis of loan_status... Statistical summary for loan_status: count 37485 unique 2 top Fully Paid freq 32026 Name: loan_status, dtype: object The mode of loan_status is: Fully Paid Count summary for loan_status: loan_status Fully Paid 32026 Charged Off 5459 Name: count, dtype: int64
Initiating detailed analysis of issue_d_month... Statistical summary for issue_d_month: count 37485 unique 12 top December freq 4101 Name: issue_d_month, dtype: object The mode of issue_d_month is: December Count summary for issue_d_month: issue_d_month December 4101 November 3876 October 3613 September 3390 August 3285 July 3272 June 3093 May 2848 April 2749 March 2621 January 2332 February 2305 Name: count, dtype: int64
Initiating detailed analysis of loan_amnt_b... Statistical summary for loan_amnt_b: count 37485 unique 7 top 5000-10000 freq 12677 Name: loan_amnt_b, dtype: object The mode of loan_amnt_b is: 5000-10000 Count summary for loan_amnt_b: loan_amnt_b 5000-10000 12677 0-5000 9411 10000-15000 7553 15000-20000 4228 20000-25000 2657 25000-30000 643 30000-35000 316 Name: count, dtype: int64
Initiating detailed analysis of int_rate_b... Statistical summary for int_rate_b: count 37485 unique 11 top 10-12 freq 8146 Name: int_rate_b, dtype: object The mode of int_rate_b is: 10-12 Count summary for int_rate_b: int_rate_b 10-12 8146 12-14 7073 6-8 6759 14-16 5096 8-10 4003 16-18 3037 18-20 1362 4-6 1335 20-22 546 22-24 124 24-26 4 0-2 0 2-4 0 Name: count, dtype: int64
Initiating detailed analysis of debt_to_income_b... Statistical summary for debt_to_income_b: count 37485 unique 15 top 12-14 freq 3809 Name: debt_to_income_b, dtype: object The mode of debt_to_income_b is: 12-14 Count summary for debt_to_income_b: debt_to_income_b 12-14 3809 14-16 3805 10-12 3621 16-18 3472 8-10 3363 18-20 3321 6-8 2903 20-22 2900 22-24 2555 4-6 2490 2-4 1865 0-2 1728 24-26 1188 26-28 277 28-30 188 Name: count, dtype: int64
Observation:¶
1. debt_to_income_b: It is a normally distributed graph with peak between 12 and 14 which indicates that the most of customers had a debt-to-income ratio within this range.¶
2. loan_amount_b: The graph indicates a right-skewed distribution with the peak indicating that most of loans are for amounts between 5,000 and 10,000.¶
3. int_rate_b: The distribution indicates that most of loans are between 10%-12%.¶
4. issue_b_month: The distribution is left-skewed, which indicates that most loan applications were issued in December,and the lowest in February.¶
5. loan_status: There are around 32,026 customers who have their loan status as fully paid while around 5,459 customers have loan status as charged-off.¶
6. verification_status: Around 16,528 customers who have been issued loans arent verified.¶
7. home_ownership: The majority of customers (around 18,240) have a home ownership status of "RENT," followed by 16,241 with "MORTGAGE," and only 2,906 who own their home.¶
8. emp_length: Customers with over 10 years of employment have applied for the most loans and that is around 9,131 applications.¶
9. sub_grade: Loans issued under subgrade 4 are the most common, while subgrade 1 has the least.¶
10. grade: The highest number of loans are issued under grade B (around 11,394), while the lowest is for grade G.¶
11. term: The majority of loans have a term of 36 months.¶
Segmented Univariate Exploration¶
Segmenting the loan status into 'fully_paid' and 'charged_off' and analyzing the variables that impact the loan status.¶
Initiating detailed analysis of int_rate... Statistical summary for int_rate: count 32026.000000 mean 11.561643 std 3.575163 min 5.420000 25% 8.490000 50% 11.490000 75% 13.980000 max 24.110000 Name: int_rate, dtype: float64 The mode of int_rate is: 10.99
Initiating detailed analysis of int_rate... Statistical summary for int_rate: count 5459.000000 mean 13.734362 std 3.611106 min 5.420000 25% 11.140000 50% 13.490000 75% 16.320000 max 24.400000 Name: int_rate, dtype: float64 The mode of int_rate is: 11.49
Initiating detailed analysis of annual_inc... Statistical summary for annual_inc: count 32026.000000 mean 64411.371832 std 32263.098470 min 4000.000000 25% 40500.000000 50% 58000.000000 75% 80000.000000 max 189000.000000 Name: annual_inc, dtype: float64 The mode of annual_inc is: 60000.0
Initiating detailed analysis of annual_inc... Statistical summary for annual_inc: count 5459.000000 mean 58272.646664 std 30395.899329 min 4080.000000 25% 36500.000000 50% 52000.000000 75% 72000.000000 max 187000.000000 Name: annual_inc, dtype: float64 The mode of annual_inc is: 60000.0
Initiating detailed analysis of loan_amnt... Statistical summary for loan_amnt: count 32026.000000 mean 10520.196871 std 6760.853423 min 500.000000 25% 5000.000000 50% 9000.000000 75% 14500.000000 max 35000.000000 Name: loan_amnt, dtype: float64 The mode of loan_amnt is: 10000.0
Initiating detailed analysis of loan_amnt... Statistical summary for loan_amnt: count 5459.000000 mean 11607.849423 std 7505.301274 min 900.000000 25% 5500.000000 50% 10000.000000 75% 16000.000000 max 35000.000000 Name: loan_amnt, dtype: float64 The mode of loan_amnt is: 10000.0
Initiating detailed analysis of int_rate_b... Statistical summary for int_rate_b: count 32026 unique 11 top 10-12 freq 7097 Name: int_rate_b, dtype: object The mode of int_rate_b is: 10-12
Initiating detailed analysis of int_rate_b... Statistical summary for int_rate_b: count 5459 unique 11 top 12-14 freq 1141 Name: int_rate_b, dtype: object The mode of int_rate_b is: 12-14
Initiating detailed analysis of loan_amnt_b... Statistical summary for loan_amnt_b: count 32026 unique 7 top 5000-10000 freq 11046 Name: loan_amnt_b, dtype: object The mode of loan_amnt_b is: 5000-10000
Initiating detailed analysis of loan_amnt_b... Statistical summary for loan_amnt_b: count 5459 unique 7 top 5000-10000 freq 1631 Name: loan_amnt_b, dtype: object The mode of loan_amnt_b is: 5000-10000
Initiating detailed analysis of grade... Statistical summary for grade: count 32026 unique 7 top B freq 9998 Name: grade, dtype: object The mode of grade is: B
Initiating detailed analysis of grade... Statistical summary for grade: count 5459 unique 7 top B freq 1396 Name: grade, dtype: object The mode of grade is: B
Initiating detailed analysis of emp_length... Statistical summary for emp_length: count 32026 unique 11 top 10 freq 7644 Name: emp_length, dtype: object The mode of emp_length is: 10
Initiating detailed analysis of emp_length... Statistical summary for emp_length: count 5459 unique 11 top 10 freq 1487 Name: emp_length, dtype: object The mode of emp_length is: 10
Initiating detailed analysis of verification_status... Statistical summary for verification_status: count 32026 unique 3 top Not Verified freq 14407 Name: verification_status, dtype: object The mode of verification_status is: Not Verified
Initiating detailed analysis of verification_status... Statistical summary for verification_status: count 5459 unique 3 top Not Verified freq 2121 Name: verification_status, dtype: object The mode of verification_status is: Not Verified
Initiating detailed analysis of home_ownership... Statistical summary for home_ownership: count 32026 unique 5 top RENT freq 15454 Name: home_ownership, dtype: object The mode of home_ownership is: RENT
Initiating detailed analysis of home_ownership... Statistical summary for home_ownership: count 5459 unique 4 top RENT freq 2786 Name: home_ownership, dtype: object The mode of home_ownership is: RENT
Observations¶
1. Interest Rate (int_rate): Loans which have interest rates between 10%-12% have a higher chance of loan status being fully paid off. However, loans having interest rates between 12%-14% show a higher likelihood of charge-off.¶
2. Annual Income (annual_inc): Loans Status fully paid when the customer's annual income is between 45K-60K. In contrast, the customers having an annual income between 30K-45K tend to have more charged-off loans.¶
3. Grade: Borrowers with 10 or more years of employment have the highest count of loans and also is the highest in fully paid and charged-off segments. This indicates that they take more loans and are also prone to default more.¶
4. Employment Length (emp_length): Borrowers having 10 or more years of employment have the highest loan counts in fully paid and default segments. This indicates they take more loans and also default more.¶
5. Verification Status: Loans to borrowers who are not verified show higher peaks in being paid off compared to those who are verified. This indicates that unverified borrowers tend to repay their loans more frequently.¶
6. Home Ownership: Renters have a higher default rate, which might be explained by their higher expenses towards rent.¶
Bivariate Exploration¶
Bivariate Analysis - Numerical vs Numerical¶
Observations :¶
1. Loan Amount vs. Interest Rate:¶
- Slight positive correlation: Higher loan amounts correlate with slightly lower interest rates.¶
2. Loan Amount vs. Installment:¶
- Positive correlation: Larger loans entail higher installment payments.¶
3. Loan Amount vs. Annual Income:¶
- Weak positive correlation: As annual income increases, loan amount tends to rise, albeit weakly.¶
4. Loan Amount vs. Public Record Bankruptcies:¶
- Negative correlation: Public record bankruptcies minimally affect the loan amount qualification.¶
5. Annual Income vs. Interest Rate:¶
- Weak negative correlation: Drawing conclusions from this scatter plot alone is challenging due to the weak correlation.¶
6. Annual Income vs. Debt-to-Income Ratio:¶
- Negative correlation: As annual income increases, the debt-to-income ratio tends to decrease.¶
Bivariate Analysis - Categorical vs Numerical¶
Observations:¶
1. Term vs Loan Amount:¶
Loans with a 60-month term have higher average amounts as compared to those with a 36-month term.¶
2. Grade vs Loan Amount:¶
- Median of the loan amounts increases with higher grades (A to G), that indicates a very strong positive correlation.¶
- Grades C, D, and E exhibit wider variety in loan amounts when compared to other grades.¶
4. Employment Length vs. Loan Amount:¶
There is no clear correlation between employment length and the loan amount. Variety in loan amounts differs across employment lengths.¶
4. Loan Status vs. Loan Amount:¶
Median of the loan amounts is almost same for charged-off and fully paid loans, but charged-off loans exhibit greater variety.¶
5. Loan Status vs. Interest Rate:¶
Charged-off loans have higher median interest rates than the fully paid loans, thereby suggesting a potential correlation between interest rate and loan default.¶
7. Grade vs. Interest Rate:¶
- Median of the interest rate increases with lower loan grades (A to G), indicating that riskier loans carry higher rates.¶
- Some grades (e.g., B, C, D, and E) show wider variety in interest rates.¶
7. Verification Status vs. Loan Amount:¶
There is no much difference in median income of verified and unverified customers. Unverified applicants show wider variety in income.¶
9. Home Ownership vs. Loan Amount:¶
Median of the loan amounts is high for customers with mortgages as compared to renters and others, suggesting a correlation between homeownership and the loan amount.¶
Bivariate Analysis - Categorical vs Categorical¶
Category: term
| term | loan_status | 36 months | 60 months |
|---|---|---|---|
| 0 | Charged Off | 11.15 | 25.25 |
| 1 | Fully Paid | 88.85 | 74.75 |
Category: loan_amnt_b
| loan_amnt_b | loan_status | 0-5000 | 5000-10000 | 10000-15000 | 15000-20000 | 20000-25000 | 25000-30000 | 30000-35000 |
|---|---|---|---|---|---|---|---|---|
| 0 | Charged Off | 13.91 | 12.87 | 13.88 | 17.43 | 19.5 | 21.31 | 25.0 |
| 1 | Fully Paid | 86.09 | 87.13 | 86.12 | 82.57 | 80.5 | 78.69 | 75.0 |
Category: int_rate_b
| int_rate_b | loan_status | 0-2 | 2-4 | 4-6 | 6-8 | 8-10 | 10-12 | 12-14 | 14-16 | 16-18 | 18-20 | 20-22 | 22-24 | 24-26 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Charged Off | NaN | NaN | 4.12 | 5.65 | 9.59 | 12.88 | 16.13 | 19.35 | 26.11 | 30.4 | 36.26 | 44.35 | 50.0 |
| 1 | Fully Paid | NaN | NaN | 95.88 | 94.35 | 90.41 | 87.12 | 83.87 | 80.65 | 73.89 | 69.6 | 63.74 | 55.65 | 50.0 |
Category: grade
| grade | loan_status | A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|---|---|
| 0 | Charged Off | 6.07 | 12.25 | 17.35 | 22.22 | 26.71 | 32.11 | 33.21 |
| 1 | Fully Paid | 93.93 | 87.75 | 82.65 | 77.78 | 73.29 | 67.89 | 66.79 |
Category: home_ownership
| home_ownership | loan_status | MORTGAGE | NONE | OTHER | OWN | RENT |
|---|---|---|---|---|---|---|
| 0 | Charged Off | 13.68 | 0.0 | 17.89 | 14.97 | 15.27 |
| 1 | Fully Paid | 86.32 | 100.0 | 82.11 | 85.03 | 84.73 |
Category: purpose
| purpose | loan_status | car | credit_card | debt_consolidation | educational | home_improvement | house | major_purchase | medical | moving | other | renewable_energy | small_business | vacation | wedding |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Charged Off | 10.82 | 10.76 | 15.28 | 16.93 | 12.21 | 15.76 | 10.43 | 15.81 | 15.96 | 16.22 | 18.56 | 27.3 | 14.25 | 10.35 |
| 1 | Fully Paid | 89.18 | 89.24 | 84.72 | 83.07 | 87.79 | 84.24 | 89.57 | 84.19 | 84.04 | 83.78 | 81.44 | 72.7 | 85.75 | 89.65 |
Observations:¶
1. Comparison between Loan Status and Term:¶
Risk Levels (Charged Off Rates):¶
Distribution of Loan Terms:¶
2. Comparison between Loan Status and Loan Amount:¶
Risk Levels (Charged Off Rates):¶
Less Risky: 5000-10000 (12.87%), 0-5000 (13.91%), 10000-15000 (13.88%)¶
More Risky: 15000-20000 (17.43%), 20000-25000 (19.5%), 25000-30000 (21.31%)¶
Distribution of Loan Amounts:¶
More Prevalent: Loans in the 5000-10000 and 0-5000 ranges¶
Less Prevalent: Loans in the 25000-30000 and 20000-25000 ranges, indicating fewer larger loans.¶
3. Comparison between Loan Status and Interest Rate:¶
Risk Levels (Charged Off Rates):¶
Less Risky: 4-6% (4.12%), 6-8% (5.65%), 8-10% (9.59%)¶
More Risky: 16-18% (26.11%), 18-20% (30.4%), 20-22% (36.26%), 22-24% (44.35%)¶
Distribution of Interest Rates:¶
More Common: Loans with 10-12% and 6-8% interest rates¶
Less Common: Loans with 22-24% and 20-22% interest rates, indicating fewer high-risk, high-interest loans.¶
4. Comparison between Loan Status and Grade:¶
Risk Levels (Charged Off Rates):¶
Less Risky: Grade A (6.07%), B (12.25%), and C (17.35%)¶
More Risky: Grade D (22.22%), E (26.71%), F (32.11%), and G (33.21%)¶
Distribution of Loan Grades:¶
5. Comparison between Loan Status and Home Ownership:¶
Risk Levels (Charged Off Rates):¶
More Risky: OTHER (17.89%), RENT (15.27%), OWN (14.97%)¶
Less Risky: MORTGAGE (13.68%), NONE (0.0%)¶
Distribution of Home Ownership:¶
High Prevalence: RENT and MORTGAGE are most common.¶
6. Comparison between Loan Status and Purpose:¶
Risk:¶
High Risk: Small business, renewable energy, educational loans.¶
Low Risk: Wedding, car, credit card, major purchase loans.¶
Demand:¶
Multivariate Analysis¶
Observation:¶
1. Loan Amounts:¶
Strong positive correlations with funded amount and funded amount invested, moderate correlation with installment, and weak correlation with borrower characteristics and interest rates.¶
2. Funded Amount:¶
Strong positive correlations with loan amount and funded amount invested, moderate correlation with installment, and weak correlation with borrower characteristics and interest rates.¶
3. Funded Amount Invested:¶
Strong positive correlations with loan amount and funded amount, moderate correlation with installment, and weak correlation with borrower characteristics and interest rates.¶
4. Interest Rate:¶
Weak correlations with loan attributes, indicating some association but not particularly strong.¶
5. Installment:¶
Strong positive correlations with loan amount, funded amount, and funded amount invested, weak correlation with borrower characteristics and interest rates.¶
6. Annual Income:¶
Weak correlations with loan attributes, indicating a slight association with loan amounts but not particularly strong.¶
7. Debt-to-Income Ratio:¶
Weak correlations with loan attributes, suggesting some association but not particularly strong.¶
8. Public Record Bankruptcies:¶
Almost negligible correlation with loan attributes, indicating minimal association with borrower characteristics and loan terms.¶
9. Issue Date Year:¶
Negligible correlation with loan attributes except for a somewhat stronger correlation with funded amount invested.¶
Recommendations¶
Based on the above analysis, below are the factors that influence loan defaults:¶
Loan Term: Loans with longer terms (60 months) have higher default rates compared to shorter-term loans (36 months), indicating higher risk or borrower instability.¶
Loan Amount: For loan amounts exceeding $15,000, there is a trend of higher default rates. Borrowers may find it difficult to make repayments for larger loans, leading to increased charge-offs.¶
Interest Rate: Loans with higher interest rates, such as 16%, are associated with higher charge-off rates. High-interest loans may attract riskier customers.¶
Loan Grade: Lower-grade loans (D, E, F, G) have higher charge-off rates compared to higher-grade loans (A, B, C).¶
Home Ownership: Non-traditional home ownership, such as renting or other categories, is linked with higher charge-off rates compared to mortgage holders.¶
Loan Purpose: Loans for small businesses, renewable energy, and education exhibit higher charge-off rates, while loans for weddings, cars, credit cards, and major purchases have lower charge-off rates.¶